sirajwikiaorg-20200213-history
Dimensionality Reduction
Transcript Hello world! It's Siraj. And let's visualise a dataset of eating habits to see what we can learn from it, shall we? There's some really hard scientific questions out there: *Are we alone in the universe? *What is consciousness? *What is dark matter really? Questions like these, have multi-million dollar payouts and have been troubling scientists for hundreds of years! What guess what, -- whaaat? -- We very likely already have the answers to them! 'The problem is that they aren't in plain sight - they are hidden in data. ' The former leader of the US government effort to sequence the human genome said that it took several years of effort, a huge team of researchers, and fifty million dollars to find the gene responsible for cystic fibrosis in A'n'I. But that same project could now be accomplished in a few days by a good grad student. You watching this right now, yes you, can make a Nobel worthy break through with just your laptop! The data you need is freely available. You just have to discover the hidden relationships in it. The field of dimensionality reduction, is all about disovering non-linear, non-local relationships in data that are not obvious in the original feature space. If we reduce the number of dimensions in some data, we can visualise it because a projection in D or D space can be plotted really easily, unless you use PhP. Training a data model on a data set with many dimensions usually requires vast time and space complexity. It also often leads to overfitting. Not all the features we have available to us are relevant to our problem. If we reduce the dimensions, we can reduce the noise - the unnecessary parts of the data and find those that are, surprisingly, very closely related. And once in a smaller subspace, we can more easily apply simple learning algorithms to it. We can divide dimensionality reduction into two different topics: feature selection and feature extraction. Selection is all about finding the most relevant features to a problem. It can be based on our intuition as in: Which features do we think would be the most relevant? Or we could let a model, one of the best features by itself. *angelic singing* Extraction means finding new features after transforming the data from a high dimensional space to a lower dimensional space. Wall performed the latter technique called principle component analysis. Our data set is going to be a record of the seventeen types of food that the average person consumes per week, for every country in the UK. So we've got seventeen features/dimensions, let's see what kind of insights we can get from this data. PCA transforms variables into a new set of variables, which are a linear combination of the original variables. These new variables are known as principle components. PCA is an orthogonal, liner transformation that transforms data to a new coordinate system such that the greatest variance by some projection of the data lies on the first principle component. The second greatest variance, on the second component, and so on. The variance is the measure of how spread out the data is. If I were to measure the variance of the height of a team of basketball players, it would be pretty low. But if I added a group of primary school children to the mix, the variance would be pretty high. Our first step is to standardise the data PCA is a variance maximising exercise: It projects the original data onto a direction which maximises variance. If we were to graph a small dataset that shows the amount of variance for the different principle components, it'll seem like only one component explains all the variance in the data - like Putin at the G Summit. But if we standardise the data first, then we'll see that the other components do indeed contribute to the variance aswell. Standardising means putting the data on the same unit scale. For us, that would be grams for everything not a combination of kilograms and grams. That means the data should have a mean of zero, and a variance of one. A mean is just the average value of all Xs in the set X. Which we can find by dividing the sum of all the data points by the number of them. The way we calculate variance, is by computing the standard deviation, squared. The standard deviation is the square root of the average distance of data-points to the mean. It's used to tell how measurements for a group are spread out from the mean. Once our data is standardised, we're going to perform Eigen decomposition. So if your mumma is so fat she's not embeddable in ' space, Eigen pairs will help fix that. Eigen is a German word that roughly translates to characteristic. And in linear algebra, an Eigen vector is a vector that doesn't change its direction under the associated linear transformation. *sick beats* *sick beats* Decompose my soul and you'll find an Eigen pair *sick beats* *sick beats* Decompose my soul: a vector, value pair *sick beats* Decompose my soul Another way of putting it, is that if we have a non-zero vector: V Then it's an Eigen vector of square matrix A, if AV is a scalar multiple of V. So the lambda scalar is an Eigen value, or characteristic value, associated with the Eigen vector V. Eigen values are the coefficients attatched to Eigen vectors that give the axis magnitude. If we had a sheer mapping and displaced every point in a fixed direction, notice how the red arrow changes direction, but the blue arrow doesn't! The blue arrow is the Eigen vector of the mapping because it doesn't change direction and its length is unchanged. Its Eigen value is . Both terms are important in many fields, especially physics since they can help measure the stability of rotating bodies and oscillations of vibrating systems. Many problems can be modelled with linear transformations. And Eigen vectors give very simple solutions. There are way too many dimensions in this painting. There definitely are, but speaking of both, I prefer PCA. Why? Several reasons but, I mean, PCA is deterministic, t-SNE isn't. So the correct answer is guaranteed. And it makes data plottable on a D graph. Which-- So we could even paint it ourselves. I'm down to paint some D later! Let's datify a canvas! If we had a system of linear diffrential equations, for example, to measure how the growth of the population of two species, X and Y, affect one-another - like if one is the predator of another. Solving this system directly is complicated, but if we could introduce two new variables, Z and W, which depend lineally on X we can decouple the system so instead we are dealing with two independent functions. The Eigen vectors and Eigen values for this matrix of coefficients do just this. They decouple the ways in which a linear transformation acts into a number of independent actions along separate directions that can be dealt with independently. So we will need to construct a co-variance matrix. Then we will perform Eigen decomposition on that matrix. A matrix is just a table of values; a co-variance matrix is symmetric so the table has the same heading across the top as it does along the sides. It describes the variance of the data and the co-variance among the variables. Co-variance is the measure of how two variables change with respect to each other. It's positive when variables show similar behaviour, and negative otherwise. PCA tries to draw a straight line through data like linear regression. Each straight line is a principle component: a relationship between an independent and dependent variable. The number of principle components equals the number of dimensions in the data, and PCA's job is to prioritise them. If two variables change together, it's very likely because one is acting on the other or they're both subject to the same hidden force. Performing Eigen decomposition on our co-variance matrix helps us find the hidden forces at work in our data since we can't eyeball inter-variable relationships in high dimensional space. When calculating the co-variance matrix, the mean vector that's used to help do so is one where each value represents a sample mean of a feature column in the data set. Once we have our Eigen pairs, we want to select the principle components. We need to decide which ones can be dropped - and that's where our Eigen values come in. We'll rank the Eigen values from highest to lowest. The lowest ones bare the least info about the distribution of the data so we can drop a number of them like they're... cold. Next we will construct a projection matrix. This is just a matrix of our concatenated, top K Eigen vectors. We can choose how many dimensions we want for our subspace by choosing that amount of Eigen vectors to construct our D by K dimensional Eigen vector matrix: W. Lastly we will use this projection matrix to transform our samples onto the subspace via. a simple dot product operation. If we project our data onto a one dimensional space then we can already see something interesting. Notice, how Northern Ireland is a major outlier. It makes sense, according to the data, that the Northern Irish consume way more potatoes and alcohol and way too few healthy options. The same thing happens if we graph both components: We can see relations between data-points that we wouldn't otherwise. 'To summarise, principle component analysis is a technique that transforms a dataset onto a lower dimensional subspace so we can visualise and find hidden relationships in it. The principle components are Eigen vectors couples with Eigen values. They describe the direction in the original feature space with the greatest variance in the data. And the variance is a measure of how spread out some data is. ' The winner of last weeks coding challenge is Ong Ja Rui. He implemented a self-organising feature map for colours and for handwritten digits. Really efficient code and well documented. Great job Ong, Wizard of the Week. And the runner up is Hammad Shaikh, who developed a super detailed Jupiter notebook on self-organising maps for class size effects on students. This weeks coding challenge is to perform PCA from scratch, on a dataset of your choice. Post your GitHub link in the comments and I'll give the winners a shoutout next week. *outro music* Please subscribe for more programming videos *outro music* and for now, I've got to release a music video. *outro music* So, thanks for watching!